This repository contains all the code and data needed to reproduce the results in 

Blackwell and Olson. 2021. "Reducing Model Misspecification and Bias in the Estimation of Interactions." *Political Analysis*. 

Before running any of the scripts, please first run `00_setup_packages.R` to install the correct versions of all packages used in these analyses. In particular, some packages (especially `glmnet`) depend on specific version numbers. 

The empirical applications are the most straightforward to reproduce. The `direct_primary.R` and `remittances.R` files can be run to generate Figure 5, 6, SM.9, and SM.10. 

The simulation analyses are the most complex and computationally demanding. Originally, these simulations were run on a cluster of ~100 nodes using HTCondor scheduling software. Even on this cluster, the simulations took **more than one day to complete**. We would estimate that the amount of time required to reproduce these simulations on a single workstation would be at least weeks and possibly months. Because of this, we have pre-populated the `output/` directory with our own simulation results that can be used to generate the simulation figures (2, 3, 4, and SM.1-8). These can be generated by running `main_simulations_plots.R` and `binary_simulations_plots.R`. The figures in the paper correspond to the following filenames (extra plots are produced by these files):

- Figure 1: `example-coefs.pdf` (from `figure_1.R`)
- Figure 2: `main-bias-sim.pdf` and `main-rmse-sim.pdf`
- Figure 3: `main-cov-sim.pdf`
- Figure 4: `dense-rmse-sim-all.pdf`
- Figure 5: `primary_plot.pdf`
- Figure 6: `remittances_coefplot.pdf`
- Figure SM.1: `main-bias-sim-all.pdf` and `main-rmse-sim-all.pdf`
- Figure SM.2: `dense-bias-sim-all.pdf` and `dense-cov-sim.pdf`
- Figure SM.3: `main-bias-sim-alt.pdf`
- Figure SM.4: `main-rmse-sim-alt.pdf` and `main-cov-alt-sim.pdf`
- Figure SM.5: `large_n-bias-sim-all.pdf` and `large_n-rmse-sim-all.pdf`
- Figure SM.6: `large_n-rmse-sim-alt.pdf` and `large_n-cov-sim-alt.pdf`
- Figure SM.7: `binary-bias-sim-all.pdf` and `binary-rmse-sim-all.pdf`
- Figure SM.8: `binary-cov-sim.pdf`
- Figure SM.9: `primary_plot_additional.pdf`
- Figure SM.10: `remittances_coefplot_additional.pdf`


For convenience, we have provided a `Makefile` that will run all code and produce all plots. For example, running `make direct_primary` will run the code to generate the direct primary plots. For the simulation plots, here are the following targets:

- `make main_figs`: will produce the plots associated with the main, dense, and largen settings. 
- `make binary_figs`: will produce the plots associated with the binary DGP. 

If you want or need to reproduce the raw simulation output, first move or delete the csv files from the `output/` directory. We recommend that you use the `Makefile` provided to produce these simulations with the following commands:

- `make main`: will run the main simulations for Figures 2, 3, SM.1, SM.3, and SM.4
- `make dense`: will run the dense DGP simulations for Figure 4, SM.2
- `make largen`: will run the large-N simulations for Figure SM.5 and SM.6
- `make binary`: will run the binary DGP simulations for Figures SM.7 and SM.8

If you want or need to run these simulations without using `make`, you can directly run the scripts at the command line using `R CMD BATCH`. The `main_simulations.R` and `binary_simulations.R` will actually run the simulation for a particular choice of parameters, selected by command-line arguments to R. In order these are:

- `run`: a number to add to 12345 that sets the seed. This is useful if you want to break up these into smaller batches and run with different seets. 
- `n_units`: number of rows in the simulated data
- `R22`: partial R^2 for XV interaction on Y (see paper)
- `R21`: partial R^2 for XV interaction on D (see paper)
- `n_covs`: number of covariates in X
- `sims`: number of simulations
- `type`: DGP type (main, dense, sparse) 

For example, to produce the simulated results for a particular setup we could the following at the command line:

``` shell
R CMD BATCH --vanilla --no-save "--args 0 425 0.25 0.25 20 1000 main" main_simulations.R
```

Each run produces a csv file in the `output/` directory and the `_plots.R` files use those output files to actually generate the plots in the main text and in the supplemental materials. 

## Packages used

Due to server constraints, we used R version 3.5.2 to generate the simulation data from the following files:

- `main_simulations.R`
- `binary_simulations.R`

There was a change to the random number generator in R 3.6.0 that would typically cause random number sequences to differ, so we use the following command to ensure that the random sequences are consistent across versions:

```{r}
if (getRversion() >= "3.6.0") RNGkind(sample.kind = "Rounding")
```

All other R code (including the generation of plots from the simulation data) was run on R version 4.0.2. 

Below is a list of the packages and versions used to generate our findings. 

- R version 4.0.2 (empirical applications, plots), version 3.5.2 (simulations generation)
- `glmnet` version 2.0.18
- `sandwich` version 3.0.0
- `KRLS` version 1.0.0
- `BART` version 2.9
- `MASS` version 7.3.51.6
- `boot` version 1.3.25
- `parallel` version 4.0.2
- `ggplot2` version 3.3.2
- `lfe` github repo sgaure/lfe commit `0bba88154c3f4b2d214d9147923375d07e8ca4e5`
- `RColorBrewer` version 1.1.2
- `lmtest` version 0.9.38
- `haven` version 2.3.1
- `foreign` version 0.8.80
- `plyr` version 1.8.6
- `inters` github repo mblackwell/inters commit `0dee552e2921feb7d86d6a5f08fa51231553ef06`
